llama : disable graph reuse with pipeline parallelism#20463
llama : disable graph reuse with pipeline parallelism#20463
Conversation
7bc73ae to
dfa3ad1
Compare
ORippler
left a comment
There was a problem hiding this comment.
Let's take the time to understand what's going on here
|
hi, Starting from version Due to this substantial regression, I am currently sticking with the last known functional version prior to this change. I have not yet tested other models to see if this issue is isolated to this specific architecture or configuration. |
|
@ggerganov thanks for the repro.
In my setup (RTX PRO 6000, 4500), I get the following PPLs:
Regarding the graph reuse logic, I need to dig deeper. According to the PPL results above, just disabling the incorrect part of my PR is enough. Did you test on workstation GPUs too, or were you also using other GPUs (consumer/GeForce or AMD)? |
|
With further analysis, I need to retract my first assumption. The I rather think that the event-based scheduling mechanism implicitly relied on synchronizations in the blocking CPU->GPU copy operations, and that removing them surfaced an already existing bug/blurry edge case in the event-based scheduling mechanism. I briefly tested the assumption that the event-based scheduling mechanism for pipeline parallelism implicitly requires similar stream synchronizations to the single GPU case. I will open a new draft PR for further discussions on this. |
The following repro demonstrates the issue:
make -j && ./bin/llama-perplexity -hf ggml-org/Qwen3-0.6B-GGUF -f wiki.test.raw --chunks 16 -ngl 99 -ub 512 -b 2048 PPL = 2382.4719 +/- 246.20903The problem seems to occur when 2 consecutive pp ubatches both output logits, which occurs in perplexity runs with the above parameters: 4 ubatches of size 512, the second 2 ubtaches output logits. I think the graph reuse logic somehow conflicts with the scheduler's logic for tracking the current copy:
llama.cpp/ggml/src/ggml-backend.cpp
Lines 1775 to 1780 in 557fe2d
For now disabling graph reuse when pipeline parallelism is active to workaround. Proper investigation is necessary.
Additionally, after #17795, only disabling the graph reuse is not enough to fix the issue. So for now also reverting that change.
Note that both commits in this PR are needed. Neither one fixes the issue alone.
cc @aendk @gaugarg-nv